Add vector exporter for semantic search embeddings#32
Merged
Conversation
Adds a new 'vector' output format that converts SFS documents to vector embeddings suitable for semantic search and retrieval. Key features: - Applies temporal filtering (like md/html mode) to include only current regulations - Intelligent text chunking (by paragraph, chapter, section, or semantic boundaries) - OpenAI text-embedding-3-large model (best quality, 3072 dimensions) - Multiple backend support: PostgreSQL/pgvector, Elasticsearch, JSON file - Integrated into sfs_processor.py with CLI options New files: - exporters/vector/__init__.py - Module entry point - exporters/vector/vector_export.py - Main export functionality - exporters/vector/chunking.py - Document chunking strategies - exporters/vector/embeddings.py - Embedding provider interface - exporters/vector/backends/ - Vector store implementations Usage: python sfs_processor.py --formats vector --vector-backend postgresql
Add documentation for the new vector export format including: - Overview of vector format in output formats section - Temporal processing behavior for vector format - CLI parameters for vector-specific options - Dedicated section explaining semantic search use cases - Backend comparison table (JSON, PostgreSQL, Elasticsearch) - Usage examples with mock and production embeddings
JSON backend now saves vectors to output directory instead of repository root. Sets backend_config["file_path"] when backend_type is "json". 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Standardize all metadata field names to English across vector export: - ikraft_datum → effective_date (when regulation takes effect) - utfardad_datum → issued_date (when regulation was issued) - upphor_datum → expiration_date (when regulation expires) - upphavd → repealed (if regulation is repealed) Changes: - Updated VectorRecord and DocumentChunk with English field names - Modified PostgreSQL schema with English column names - Updated Elasticsearch index mappings - Added metadata normalization from Swedish to English - Enhanced metadata extraction from both frontmatter and selex attributes - All backends (JSON, PostgreSQL, Elasticsearch) now use consistent English fields 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Change date fields from TEXT to DATE type for proper date handling: - effective_date: TEXT → DATE - issued_date: TEXT → DATE - expiration_date: TEXT → DATE Elasticsearch already uses correct date type with format "yyyy-MM-dd||strict_date". This enables proper date queries, sorting, and range filtering in PostgreSQL. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a new 'vector' output format that converts SFS documents to vector
embeddings suitable for semantic search and retrieval. Key features:
New files:
Usage: python sfs_processor.py --formats vector --vector-backend postgresql